Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Upstairs state machine refactoring (3/3) #1577

Merged
merged 37 commits into from
Jan 13, 2025
Merged

Conversation

mkeeter
Copy link
Contributor

@mkeeter mkeeter commented Nov 26, 2024

(staged on top of #1570)

This PR refactors the DsState type into more specific, data-bearing types. Here's a quick look at the new types:

pub enum DsState {
    /// New connection
    Connecting {
        state: NegotiationState,
        mode: ConnectionMode,
    },

    /// Ready for and/or currently receiving IO
    Active,

    /// This downstairs is undergoing LiveRepair
    LiveRepair,

    /// The IO task for the client is being stopped
    Stopping(ClientStopReason),
}

pub enum ConnectionMode {
    Offline,
    New,
    Faulted,
    Replaced,
}

pub enum NegotiationState {
    /// Initial state, waiting to hear `YesItsMe` from the client
    ///
    /// Once this message is heard, transitions to either `WaitActive` (if
    /// `auto_promote` is `false`) or `WaitQuorum` (if `auto_promote` is `true`)
    Start {
        auto_promote: bool,
    },

    /// Waiting for activation by the guest
    WaitActive,

    /// Waiting for the minimum number of downstairs to be present.
    WaitQuorum,

    WaitForPromote,
    WaitForRegionInfo,
    GetLastFlush,
    GetExtentVersions,

    /// Initial startup, downstairs are repairing from each other.
    Reconcile,

    /// Waiting for live-repair to begin
    LiveRepairReady,
}

Many data members of the DownstairsClient – which were only valid for a single state – are now data within a specific variant, e.g. auto_promote and negotiation_state. This makes the code both more precise and harder to mess up.

+400 of this change is updating the OpenAPI spec, which includes DsState. I don't think anyone else is expecting it to be a particular shape, but let me know if I'm wrong (also, it already changed in #1570, so warn me there as well!).

Remaining work:

  • Answer XXX questions, which are places that I wasn't sure about existing behavior
  • Figure out what to do about checked_state_transition, which I just deleted for the moment.
    • Manually specifying valid state transitions and manually writing a bunch of good + bad test cases is a lot of toil; could we be smarter about this?
    • How do we handle nested state machines (negotiation_state within DsState::Connecting?)
  • Update state machine diagrams

@mkeeter mkeeter force-pushed the mkeeter/lean-mean-state-machine branch from 79652b6 to 93a6788 Compare December 2, 2024 14:44
@mkeeter mkeeter changed the title Upstairs state machine refactoring Upstairs state machine refactoring (3/3) Dec 2, 2024
@mkeeter mkeeter force-pushed the mkeeter/lean-mean-state-machine branch 3 times, most recently from d23dd68 to 3835f2a Compare December 9, 2024 15:09
@mkeeter mkeeter force-pushed the mkeeter/lean-mean-state-machine branch from 3835f2a to 43a1e30 Compare December 10, 2024 16:39
@mkeeter mkeeter changed the base branch from main to mkeeter/explicit-stop-state December 11, 2024 14:53
@mkeeter mkeeter force-pushed the mkeeter/lean-mean-state-machine branch 3 times, most recently from 448a57a to 958627d Compare December 11, 2024 16:03
@mkeeter mkeeter marked this pull request as ready for review December 12, 2024 18:03
Copy link
Contributor

@leftwo leftwo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So far, just questions to help me understand.
I reviewed up to is_state_transition_valid() in upstairs/src/client.rs
I wanted to post this much so it would be recorded and I'll continue and either
post more later, or post that I'm done.

cmon/src/main.rs Show resolved Hide resolved
upstairs/src/client.rs Outdated Show resolved Hide resolved
upstairs/src/client.rs Outdated Show resolved Hide resolved
upstairs/src/client.rs Show resolved Hide resolved
upstairs/src/client.rs Outdated Show resolved Hide resolved
upstairs/src/client.rs Outdated Show resolved Hide resolved
upstairs/src/client.rs Show resolved Hide resolved
upstairs/src/client.rs Outdated Show resolved Hide resolved
Copy link
Contributor

@leftwo leftwo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried to answer all the XXX questions I came across, though maybe my answers are not the best :)

It may be worth considering how we test this, with regards to the toil of adding tests.
I did find that having tests helped me when doing the LiveRepair stuff, and in general having tests gives confidence to future me that I've not broken something that was not obvious from the outside. However, tests without reasons don't really help either.

upstairs/src/client.rs Outdated Show resolved Hide resolved
upstairs/src/client.rs Show resolved Hide resolved
upstairs/src/client.rs Outdated Show resolved Hide resolved
assert!(self.cfg.read_only);

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the TODO below here can come out. We are doing the transition early enough, and I think the LRR state is needed.

Copy link
Contributor Author

@mkeeter mkeeter Jan 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you think the LRR state is needed, if it's inevitably skipped for read-only Downstairs? (I agree that it's needed for the read-write case)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could short circuit things yes, but it comes with having read only downstairs have a different state path, and it means we need to special case the RO upstairs to know during downstairs state transitions that it is RO and we can skip some steps.

I don't think it's a huge deal to special case the RO here, but I liked the overall idea that we flow through the same states as much as possible. Perhaps the code is structured differently now that making the short circuit path for RO makes better sense?

There are probably a number of ways that a RO upstairs could do less as well, like not even having to replay jobs, and ignoring dependencies, probably other things as well. Do we even need to "fault" a RO downstairs? It's either there, or it's not.

I guess, once we open this door, we can keep going, it's just a matter of how far we want to go.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed – there's no need to change stuff now, but I think I'll leave in the TODO as a hint to future readers.

upstairs/src/client.rs Outdated Show resolved Hide resolved
upstairs/src/downstairs.rs Show resolved Hide resolved
upstairs/src/downstairs.rs Show resolved Hide resolved
@@ -3362,6 +3409,7 @@ impl Downstairs {
"Saw CrucibleError::UpstairsInactive on client {}!",
client_id
);
// XXX should we also change the upstairs state here?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a reasonable question. If the downstairs has decided we are not active, then I would expect other downstairs to come along and report the same. However, I wonder if there could be some problem that a downstairs replacement would fix and by deactivating the upstairs here, we will have closed the door for any repair/replacement.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the YouAreNoLongerActive only is sent when there's another activation that results in a take over, specifically a take over from the perspective of that downstairs. It might make sense for this to result in an upstairs state change?

We may actually want to reject repairs or replacements if we receive this message, as this Upstairs is being taken over?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Opened #1600 to track this, but I don't think it should block the PR (because it hasn't changed).

upstairs/src/lib.rs Outdated Show resolved Hide resolved
upstairs/src/upstairs.rs Show resolved Hide resolved
@mkeeter mkeeter force-pushed the mkeeter/lean-mean-state-machine branch from 26e7d50 to 58ed39e Compare January 6, 2025 14:57
@mkeeter mkeeter force-pushed the mkeeter/explicit-stop-state branch from c5f27dd to 8b60106 Compare January 6, 2025 19:17
@mkeeter mkeeter force-pushed the mkeeter/lean-mean-state-machine branch from c76a219 to e3743ae Compare January 6, 2025 19:17
Base automatically changed from mkeeter/explicit-stop-state to main January 6, 2025 19:18
@mkeeter mkeeter force-pushed the mkeeter/lean-mean-state-machine branch from e3743ae to 535de0c Compare January 6, 2025 19:18
@mkeeter mkeeter force-pushed the mkeeter/lean-mean-state-machine branch from e3246e2 to 1698442 Compare January 7, 2025 14:53
Copy link
Contributor

@leftwo leftwo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the only thing to think about is what to do with version mismatch between upstairs and downstairs and if that should result in a retry (to the same downstairs) or deactivation of the upstairs.

If we deactivate the upstairs, any propolis is essentially dead in the water at that point and won't be able to make progress as we don't have a way to send an upstairs a "activate" message a 2nd time.

upstairs/src/client.rs Show resolved Hide resolved
upstairs/src/client.rs Outdated Show resolved Hide resolved
upstairs/src/client.rs Outdated Show resolved Hide resolved
DsState::Connecting {
mode: ConnectionMode::New,
state: NegotiationState::Start { .. },
} => {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this also include

DsState::Connecting {
  mode: ConnectionMode::New,
  state: NegotiationState::WaitActive,
}

to match what was previously there?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure – it would require precise timing, but I could believe there's probably a path where one client makes it to WaitActive while others are halting the IO task. Done in ea999fb

upstairs/src/client.rs Outdated Show resolved Hide resolved
};
self.state = DsState::Connecting {
state: NegotiationState::WaitForPromote,
mode,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could this be anything but ConnectionMode::New in this match?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope, good catch! Done in 12c38e1 (which also simplifies Upstairs::set_active_request)


// Restart the IO task for that specific client, transitioning to a new
// state.
self.clients[client_id].reinitialize(up_state, self.can_replay);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

noticing that there's a slight behaviour change here: reinitialize now runs before skip_all_jobs

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup – I don't think there's any actual difference, though. reinitialize doesn't actually send jobs; it just restarts the client IO task.

@@ -3362,6 +3409,7 @@ impl Downstairs {
"Saw CrucibleError::UpstairsInactive on client {}!",
client_id
);
// XXX should we also change the upstairs state here?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the YouAreNoLongerActive only is sent when there's another activation that results in a take over, specifically a take over from the perspective of that downstairs. It might make sense for this to result in an upstairs state change?

We may actually want to reject repairs or replacements if we receive this message, as this Upstairs is being taken over?

@mkeeter mkeeter merged commit 4584aa3 into main Jan 13, 2025
16 checks passed
@mkeeter mkeeter deleted the mkeeter/lean-mean-state-machine branch January 13, 2025 16:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants